General comments:
.Rmd, and the generated output file .html or .pdf on Canvas before the due date/time..Rmd file compiles without any errors. The marker will not spend time fixing the bugs in your code.tidyverse packages to answer the questions in this assignment. Please use dplyr for data wrangling/manipulation, ggplot2 for data visualisation, and lubridate for dates/times. Some parts of Problem 2 will use plots from the fpp3 packages.Due: Friday 15 March 2023 at 16:00 PM (NZ time)
The lakers data set (in the lubridate package) contains play-by-play statistics of each Los Angeles Lakers basketball game in the 2008-2009 regular season. It contains the following variables:
| Variable | Description |
|---|---|
date |
Date of the game |
opponent |
Name of the opposition team |
game_type |
Home or away game |
time |
Time remaining on the game clock in a given period (counting down from 12 minutes) |
period |
The period of play (most games have four quarters, each 12 minutes in duration, noting that some games go into a 5-minute duration overtime if tied at the end of regular play) |
etype |
The type of play made (e.g., shot, turnover, rebound) |
team |
Name of the NBA team the player who made the play belongs to |
player |
Name of the player that the play was made by |
result |
Whether they won or lost the game |
type |
A more detailed description of the type of play made |
x |
The \(x\)-coordinate on the field of play (in ft) |
y |
The \(y\)-coordinate on the field of play (in ft) |
- 6 Marks
- Read in the
lakersdata set and convert this into atibbleobject.- Keep only the rows relating to Kobe Bryant. Name this object
kobe.- Transform the
datevariable into alubridatedate format (noting that it is currently in integer format).- Shot location is given by
xandy. The center of the hoop is located at the coordinates \((25, 5.25)\). Center thexandyvariables to \((0, 0)\); you will want to overwritexandyin yourkobedata set.
lakers <- as_tibble(lakers)
kobe <- lakers %>%
filter(player == "Kobe Bryant")
kobe$date <- ymd(kobe$date)
kobe$x <- kobe$x - 25
kobe$y <- kobe$y - 5.25
- 6 Marks
- Subset the
kobedata set by only considering plays that are shot attempts (i.e., whereetypeis equal toshot). Name this new data setkobe.shot.- Make a scatter plot of the centered shot location, colouring the points by
result. You should use thegeom_pointlayer.- Set the transparency of the points to
alpha = 0.5.- Use the default colour scheme, but reverse the colour order so that shots made is green(ish) and shots missed is red(ish). Hint: You can use
scale_colour_discretewith an additional argument.
kobe.shot <- kobe %>%
filter(etype == "shot")
kobe.shot %>%
ggplot(mapping = aes(x = x, y = y, colour = result)) +
geom_point(alpha = 0.5) +
scale_colour_discrete(direction = -1) +
theme_minimal() +
labs(x = "loc_x(ft)",
y = "loc_y(ft)",
colour = "result",
title = "Scatter Plot of the centered shot location")
- 6 Marks
- Using the
kobe.shotdata set, produce a 2-dimensional density plot (with contours) of Kobe Bryant’s shot locations. You will want to use bothgeom_density_2d_filledandgeom_density_2d. Do not colour byresult.- Remove the legend using
legend.positionargument in thethemelayer.
kobe.shot %>%
ggplot(mapping = aes(x = x, y = y)) +
geom_density_2d_filled() +
geom_density_2d() +
theme(legend.position = "none") +
labs(x = "loc_x(ft)",
y = "loc_y(ft)",
title = "2-dimensional density plot of the centered shot location")
- 9 Marks
- Within the
kobe.shotdata set, create a variable calleddistancethat calculates the distance a shot was taken from hoop. You will need to use Pythagoras’ theorem, i.e., \(\text{distance} = \sqrt{x^2 + y^2}\).- Then create another variable within your
kobe.shotdata set calledindicatorthat concatenates the values ofresultwithgame_type. Hint: You can use thepastefunction. You should end up with a variable in your data set that takes on the four values: “made home”, “made away”, “missed home”, “missed away”.- Plot histograms showing the distribution of distance using
geom_histogram. Usefacet_wrapto create seperate panels for all values ofindicator. (You should end up with four panels on the same figure).- Fill the histograms by
indicatorsuch that the interior of the bars are different colours for the four different groups.
- Remove the legend.
kobe.shot %>%
mutate(distance = sqrt(x^2 + y^2),
indicator = paste(result, game_type)) %>%
ggplot(mapping = aes(x = distance, fill = indicator)) +
geom_histogram(bins = 30) +
theme(legend.position = "none") +
facet_wrap(~indicator) +
labs(x = "Distance(ft)",
y = "Frequency",
fill = "indicator",
title = "Histograms of distance")
- 11 Marks
- Subset the original
kobedata set (not thekobe.shotsdata set) by considering plays that are only free throws (i.e., whereetypeis equal tofree throw). Call this new data setkobe.free.- Within the
kobe.freedata set, calculate the total number of points from free throws per game as well as the free throw percentage per game. You will want to use thegroup_by,summarise,sum, andnfunctions.- Plot Kobe Bryant’s free throw percentage per game using
geom_segmentto create vertical line segments from 0 to the free throw percentage. Your \(x\)-axis should bedateand your \(y\)-axis should be free throw percentage.- Add transparency proportional to the total number of points per game. (i.e., a larger number of points should have darker line segments).
kobe.free <- kobe %>%
filter(etype == "free throw")
kobe.free.points <- kobe.free %>%
group_by(date) %>%
summarise(total_points = sum(points),
percentage = total_points / n())
kobe.free.points %>%
ggplot(mapping = aes(x = date, y = percentage)) +
geom_segment(aes(xend = date,y = 0, yend = percentage, alpha = total_points)) +
theme_minimal() +
labs(title = "Histograms of Kobe Bryant's free throw percentage per game")
- 7 Marks
- Using the
kobedata set, find the unique dates that Kobe Bryant played in the 2008-2009 regular season. Hint: You will want to use thegroup_by,summarise, andn_distinctfunctions. You should end up with a data set with 78 rows. Name this data setkobe.week.- Then create a variable that tells you the day of the week the game was played. You will need an appropriate
lubridatefunction.- Plot a bar chart that shows the frequency of games played on each of the seven days of the week.
- Comment on the most common and least common game days.
The most common day is Tuesday, and the least common day is Monday and Saturday.
kobe.week <- kobe %>%
group_by(date) %>%
summarise(n_distinct(date))
kobe.week %>%
mutate(week = wday(date, label = TRUE)) %>%
ggplot(mapping = aes(x = week)) +
geom_bar() +
labs(x = "day",
y = "frequency",
title = "barplot")
Total possible marks for Problem 1: 45 Marks
The data set auckland_temps.csv contains the monthly average temperatures in Auckland from July 1994 until January 2024. [Data source: https://cliflo.niwa.co.nz]
- 5 Marks
- Read in the data using
read_csv(don’t useread.csv).
- Convert the
Monthvariable into the correct date format. Hint: You will need to use a function from thetsibblepackage.- Coerce your
tibbleto atsibbleobject withMonthas the index.
data_csv <- read_csv("auckland_temps.csv")
## Rows: 355 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Temperature
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_tsibble <- data_csv %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
- 7 Marks
- Create a time plot, seasonal plot, and subseries plot of the data.
- Comment on the seasonality in the plots. Which month has the highest average temperatures, and which month has the lowest?
- Comment on whether there is a trend in the data and if so, in what direction.
Febrary has the highest average temperatures, July has the lowest. Yes. The temperature undergoes seasonal changes, being low in winter, gradually rising in spring, reaching its peak in summer, and then declining in autumn.
data_tsibble %>%
autoplot(Temperature) +
labs(x = "time",
y = "temperature",
title = "time plot")
data_tsibble %>%
gg_season(Temperature, labels = "none") +
labs(y = "temperature",
title = "seasonal plot")
data_tsibble %>%
gg_subseries(Temperature) +
labs(x = "time",
y = "temperature",
title = "subseries plot")
- 8 Marks
- Create a lag plot for the first 12 lags. Use the
pointgeometry and setalpha = 0.5.- Write a sentence or two explaining what autocorrelation is.
- Comment on the patterns you observe in the lag plot, explaining why we see these specific autocorrelation patterns for lags 1, 6, and 12.
Autocorrelation assesses how well the values of a variable correlate with their own previous values
The temperatures in adjacent months are similar, showing a significant positive correlation. In Figure 6, with a time difference of six months, the seasons are opposite; winter and summer temperatures are opposites, situated on the outermost sides, while autumn and spring temperatures are similar and closer. In Figure 12, with a time difference of 12 months, representing a complete cycle of the four seasons, the seasons are the same, and the temperatures are almost identical, displaying a strong positive correlation.
data_tsibble %>%
gg_lag(Temperature, lags = 1:12, geom = "point", alpha = 0.5) +
labs(title = "Lag plot")
- STATS 786 only 10 Marks
- In this question, you will recreate the lag plot from (3).
- Instead of using
gg_lag, you will use functions from thedplyrandggplot2packages to create your own lag plot. You may also need functions fromlubridate,forcats, andtidyr.- Try to get your plot as close to the
gg_lagplot you produced in (3). You will get full marks if your plot is exactly the same as what you get withgg_lag. Marks will be deducted for inconsistencies.
- Note: There are many ways to solve this problem, but here are some things you may want to consider: how to create lagged variables in your data set, how to name them, how to convert your data set into a long format, and how to facet your plot.
data_tsibble_lag <- data_tsibble %>%
mutate(month_name = month(Month,label = TRUE)) %>%
mutate(Value_lag1 = lag(Temperature, 1)) %>%
mutate(Value_lag2 = lag(Temperature, 2)) %>%
mutate(Value_lag3 = lag(Temperature, 3)) %>%
mutate(Value_lag4 = lag(Temperature, 4)) %>%
mutate(Value_lag5 = lag(Temperature, 5)) %>%
mutate(Value_lag6 = lag(Temperature, 6)) %>%
mutate(Value_lag7 = lag(Temperature, 7)) %>%
mutate(Value_lag8 = lag(Temperature, 8)) %>%
mutate(Value_lag9 = lag(Temperature, 9)) %>%
mutate(Value_lag10 = lag(Temperature, 10)) %>%
mutate(Value_lag11 = lag(Temperature, 11)) %>%
mutate(Value_lag12 = lag(Temperature, 12))
data_tsibble_lag <- pivot_longer(data_tsibble_lag,
cols = Value_lag1 :Value_lag12,
names_to = "Lag",
values_to = "Lag_value",
values_drop_na = TRUE)
data_tsibble_lag$Lag <- factor(data_tsibble_lag$Lag,
levels = c("Value_lag1","Value_lag2","Value_lag3", "Value_lag4","Value_lag5","Value_lag6","Value_lag7","Value_lag8","Value_lag9","Value_lag10","Value_lag11","Value_lag12"),
labels = c("lag 1","lag 2","lag 3", "lag 4","lag 5","lag 6","lag 7","lag 8","lag 9","lag 10","lag 11","lag 12"))
data_tsibble_lag %>%
ggplot(aes(x = Lag_value, y = Temperature, color = month_name)) +
geom_point(alpha = 0.5) +
facet_wrap(~Lag) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
labs(title = "Lag plot",
x = "lag(Temperature, n)",
y = "Temperature",
color = "season") +
theme(aspect.ratio = 1)
Total possible marks for Problem 2: 20 Marks for 326 30 Marks for 786
Total possible marks for Assignment 1: 65 Marks for 326 75 Marks for 786